Arabic supervised learning method using N-gram

نویسندگان

  • Majed Sanan
  • Mahmoud Rammal
  • Khaldoun Zreik
چکیده

Purpose – Recently, classification of Arabic documents is a real problem for juridical centers. In this case, some of the Lebanese official journal documents are classified, and the center has to classify new documents based on these documents. This paper aims to study and explain the useful application of supervised learning method on Arabic texts using N-gram as an indexing method (n1⁄4 3). Design/methodology/approach – The Lebanese official journal documents are categorized into several classes. Supposing that we know the class(es) of some documents (called learning texts), this can help to determine the candidate words of each class by segmenting the documents. Findings – Results showed that N-gram text classification using the cosine coefficient measure outperforms classification using Dice’s measure and TF*ICF weight. Then it is the best between the three measures but it still insufficient. N-grammethod is good, but still insufficient for the classification ofArabic documents, and then it is necessary to look at the future of a newapproach like distributional or symbolic approach in order to increase the effectiveness. Originality/value – The results could be used to improve Arabic document classification (using software also). This work has evaluated a number of similarity measures for the classification of Arabic documents, using the Lebanese parliament documents and especially the Lebanese official journal documents Arabic corpus as the test bed.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Tw-StAR at SemEval-2017 Task 4: Sentiment Classification of Arabic Tweets

In this paper, we present our contribution in SemEval 2017 international workshop. We have tackled task 4 entitled “Sentiment analysis in Twitter”, specifically subtask 4A-Arabic. We propose two Arabic sentiment classification models implemented using supervised and unsupervised learning strategies. In both models, Arabic tweets were preprocessed first then various schemes of bag-of-N-grams wer...

متن کامل

Identification of Languages in Algerian Arabic Multilingual Documents

This paper presents a language identification system designed to detect the language of each word, in its context, in a multilingual documents as generated in social media by bilingual/multilingual communities, in our case speakers of Algerian Arabic. We frame the task as a sequence tagging problem and use supervised machine learning with standard methods like HMM and Ngram classification taggi...

متن کامل

Sentiment Classification of Arabic Documents: Experiments with multi-type features and ensemble algorithms

Document sentiment classification is often processed by applying machine learning techniques, in particular supervised learning which consists basically of two major steps: feature extraction and training the learning model. In the literature, most existing researches rely on n-grams as selected features, and on a simple basic classifier as learning model. In the context of our work, we try to ...

متن کامل

QCRI: Answer Selection for Community Question Answering - Experiments for Arabic and English

This paper describes QCRI’s participation in SemEval-2015 Task 3 “Answer Selection in Community Question Answering”, which targeted real-life Web forums, and was offered in both Arabic and English. We apply a supervised machine learning approach considering a manifold of features including among others word n-grams, text similarity, sentiment analysis, the presence of specific words, and the co...

متن کامل

Romanized Berber and Romanized Arabic Automatic Language Identification Using Machine Learning

The identification of the language of text/speech input is the first step to be able to properly do any language-dependent natural language processing. The task is called Automatic Language Identification (ALI). Being a well-studied field since early 1960’s, various methods have been applied to many standard languages. The ALI standard methods require datasets for training and use character/wor...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Interact. Techn. Smart Edu.

دوره 5  شماره 

صفحات  -

تاریخ انتشار 2008